Reducing 3D Fast Wavelet Transform Execution Time Using Blocking and the Streaming SIMD Extensions

نویسندگان

  • Gregorio Bernabé
  • José M. García
  • José González
چکیده

The video compression algorithms based on the 3D wavelet transform obtain excellent compression rates at the expense of huge memory requirements, that drastically affects the execution time of such applications. Its objective is to allow the real-time video compression based on the 3D fast wavelet transform. We show the hardware and software interaction for this multimedia application on a general-purpose processor. First, we mitigate the memory problem by exploiting the memory hierarchy of the processor using several techniques. As for instance, we implement and evaluate the blocking technique. We present two blocking approaches in particular: cube and rectangular, both of which differ in the way the original working set is divided. We also put forward the reuse of previous computations in order to decrease the number of memory accesses and floating point operations. Afterwards, we present several optimizations that cannot be applied by the compiler due to the characteristics of the algorithm. On the one hand, the Streaming SIMD Extensions (SSE) are used for some of the dimensions of the sequence (y and time), to reduce the number of floating point instructions, exploiting Data Level Parallelism. Then, we apply loop unrolling and data prefetching to specific parts of the code. On the other hand, the algorithm is vectorized by columns, allowing the use of SIMD instructions for the y dimension. Results show speedups of 5x in the execution time over a version compiled with the maximum optimizations of the Intel C/C++ compiler, maintaining the compression ratio and the video quality (PSNR) of the original encoder based on the 3D wavelet transform. Our experiments also show that, allowing the compiler to perform some of these optimizations (i.e. automatic code vectorization), causes performance slowdown, demonstrating the effectiveness of our optimizations.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Fast Trigonometric Functions Using Intel’s Sse2 Instructions

The goal of this work was to answer one simple question: given that the trigonometric functions take hundreds of clock cycles to execute on a Pentium IV, can they be computed faster, especially given that all Intel processors now have fast floating-point hardware? The streaming SIMD extensions (SSE/SSE2) in every Pentium III and IV provide both scalar and vector modes of computation, so it has ...

متن کامل

Short-Vector SIMD Parallelization in Signal Processing

Short-vector Single-instruction-multiple-data (SIMD) units have become common in signal processors. Moreover, almost all modern general-purpose processors include SIMD extensions, which makes SIMD also important in high performance computing. This chapter gives an overview of approaches to the vectorization of signal processing algorithms. Despite their complexity, these algorithms have a relat...

متن کامل

Architectural Support for 3D Graphics in the Complex Streamed Instruction Set

In this paper we extend the previously proposed Complex Streamed Instruction Set (CSI) architecture to provide for floating-point computations and conditional execution in order to efficiently support 3D graphics applications. The CSI extension is evaluated using an industry standard 3D benchmark, and compared to the Intel’s Streaming SIMD Extension (SSE). Compared to a 4-way issue superscalar ...

متن کامل

Automatic Generation of Vectorized Fast Fourier Transform Libraries for the Larrabee and AVX Instruction Set Extension

Introduction The discrete Fourier transform (DFT) and its fast algorithms (fast Fourier transforms or FFTs) are among the most important computational building blocks in signal processing and scientific computing. Consequently, there is a number of high performance DFT libraries available including Intel’s Integrated Performance Primitives (IPP), FFTW [6], and libraries generated by Spiral [9, ...

متن کامل

An Implementation of Parallel 1-D FFT Using SSE3 Instructions on Dual-Core Processors

In the present paper, an implementation of a parallel one-dimensional fast Fourier transform (FFT) using Streaming SIMD Extensions 3 (SSE3) instructions on dual-core processors is proposed. Combination of vectorization and the block six-step FFT algorithm is shown to effectively improve performance. The performance results for one-dimensional FFTs on dual-core Intel Xeon processors are reported...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • VLSI Signal Processing

دوره 41  شماره 

صفحات  -

تاریخ انتشار 2005